-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str: Implement a faster Chars iterator for &str #15638
Conversation
Wow, 20-30% 👍 |
Just to be clear, there's no way to get similar speed in safe(r) code using |
It may well be possible @huonw. An initial prototype doesn't beat the current libstd implementation though; I'll come back to that tomorrow Here we have to ask -- what's the cheapest way to do something like it.next().unwrap()? pub struct Citems<'a> {
iter: std::slice::Items<'a, u8>
}
impl<'a> Iterator<char> for Citems<'a>
{
#[inline]
fn next(&mut self) -> Option<char>
{
#[inline(never)]
fn decode_multibyte<'a>(fst: u8, it: &mut std::slice::Items<'a, u8>)
-> Option<char>
{
let w = UTF8_CHAR_WIDTH[fst as uint];
let mut ch = utf8_first_byte!(fst, w as uint);
ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
if w > 2 {
ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
if w > 3 {
ch = utf8_acc_cont_byte!(ch, it.next().map_or(0, |x| *x));
}
}
unsafe {
Some(transmute(ch))
}
}
match self.iter.next() {
Some(&next_byte) => {
if next_byte < 128 {
Some(next_byte as char)
} else {
decode_multibyte(next_byte, &mut self.iter)
}
}
None => None
}
}
} |
This is awesome, thanks for optimizing this! I, like @huonw, am also interested if this could be done with a bytes iterator, it seems like it should be able to compile down to the same thing. |
I agree -- but I haven't been able to write an implementation based on the slice iterator that performs as well. My understanding is simplified, but I tried to put the multibyte case in the slow path and make the ASCII-only case fast but the slice iterator doesn't perform as well as the pointer arithmetic based chars iterator, especially not on ASCII-dominated text. Is this related to #11751 ? My playground code for comparing the three iterator implementations is here, the slice iterator based implementation is on this line Benchmarks from the playground code (wikitext is decoding the En Wikipedia front page source)
|
Yes #11751 will get in the way, theoretically rust-lang/llvm#14 will improve it (if you felt like it, you could manipulate your src/llvm submodule to include that to test; I'm not really sure how easy that is to do). |
Updated comment because of major thinko -- nothing seems to be working the way I thought it was. |
Now I've managed to compile a rustc that knows about The result is, no improvement (
|
I'm a little uneasy adding more unsafe iteration when it already exists for vectors (duplication of logic). Is there a performance win for this rewrite using safe iteration over the previous code, or is it only a win using unsafe code? |
Duplication of logic is a good argument against, but in the end we need to provide a great Chars iterator. I don't feel that this code is particularly risky in any way, just that Rust's string handling is mature enough now that we don't have lots of holes to slip in invalid encoded data through libstd APIs anymore. The implementation using the slice iterator is a none to a few percent performance increase by itself, we could wait to see if more code generation improvements make up for the rest there. It's not even important to merge that now, but it would be cool if @zwarich could say if his improvements are going to help us. |
Perhaps we could land the other optimizations and hold off on the unsafe pointers until rust-lang/llvm#14 is merged into master? |
I did test those changes already, but if there are more it would be very interesting to see. It's not unreasonable. I should complete the implementation using just the slice iterator (the chars().rev() impl is missing), and we can see if it's worth adding on its own, it's ok. I don't feel a reimplementation of the slice iterator is that scary, it's a great model of a useful pattern itself -- it will come back when users want to implement "slices with stride > 1" for example. |
Tons of elbow grease creates a slice iterator decoder that's faster than both the present Chars iterator as well as the pointer arithmetic Chars iterator in all my testcases. It's not pretty, though. Will update PR tomorrow. #[deriving(Clone)]
pub struct CharItems<'a> {
iter: slice::Items<'a, u8>
}
#[inline]
fn unwrap_or_0(opt: Option<&u8>) -> u8 {
match opt {
Some(&byte) => byte,
None => 0,
}
}
impl<'a> Iterator<char> for CharItems<'a> {
#[inline]
fn next(&mut self) -> Option<char> {
fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char {
// NOTE: Performance is very sensitive to the exact formulation here
// Decode from a byte combination out of: [[[x y] z] w]
let CONT_MASK = 0x3F; // continuation byte mask
let init = utf8_first_byte!(x, 2);
let mut ch;
let y = unwrap_or_0(it.next());
ch = init << 6 | (y & CONT_MASK) as u32;
if x >= 0xE0 {
let z = unwrap_or_0(it.next());
let y_z = (((y & CONT_MASK) as u32) << 6) | (z & CONT_MASK) as u32;
ch = init << 12 | y_z;
if x >= 0xF0 {
let w = unwrap_or_0(it.next());
ch = (init & 7) << 18 | y_z << 6 | (w & CONT_MASK) as u32;
}
}
unsafe {
mem::transmute(ch)
}
}
match self.iter.next() {
None => None,
Some(&next_byte) => {
if next_byte < 128 {
Some(next_byte as char)
} else {
Some(decode_multibyte(next_byte, &mut self.iter))
}
}
}
}
} |
What worries me here is the result for the test called Before test str::bench::char_indicesator ... bench: 125 ns/iter (+/- 293)
test str::bench::char_indicesator_rev ... bench: 189 ns/iter (+/- 26)
test str::bench::char_iterator ... bench: 152 ns/iter (+/- 296)
test str::bench::char_iterator_ascii ... bench: 302 ns/iter (+/- 485)
test str::bench::char_iterator_for ... bench: 121 ns/iter (+/- 2)
test str::bench::char_iterator_rev ... bench: 418 ns/iter (+/- 357)
test str::bench::char_iterator_rev_for ... bench: 195 ns/iter (+/- 413) after test str::bench::char_indicesator ... bench: 126 ns/iter (+/- 2)
test str::bench::char_indicesator_rev ... bench: 137 ns/iter (+/- 15)
test str::bench::char_iterator ... bench: 113 ns/iter (+/- 11)
test str::bench::char_iterator_ascii ... bench: 310 ns/iter (+/- 14)
test str::bench::char_iterator_for ... bench: 179 ns/iter (+/- 3)
test str::bench::char_iterator_rev ... bench: 107 ns/iter (+/- 8)
test str::bench::char_iterator_rev_for ... bench: 167 ns/iter (+/- 11) |
It's the inlining decisions -- inlining even the multibyte case makes the difference -- how does that look with the update? Benching on a laptop always has you wondering, is that faster now because of some turbo boost stuff I can't control? Either way, the for loop test case sped up to the level of the others, using inline. test str::bench::char_indicesator ... bench: 85 ns/iter (+/- 3)
test str::bench::char_indicesator_rev ... bench: 82 ns/iter (+/- 6)
test str::bench::char_iterator ... bench: 85 ns/iter (+/- 2)
test str::bench::char_iterator_ascii ... bench: 317 ns/iter (+/- 2)
test str::bench::char_iterator_for ... bench: 87 ns/iter (+/- 3)
test str::bench::char_iterator_rev ... bench: 79 ns/iter (+/- 1)
test str::bench::char_iterator_rev_for ... bench: 68 ns/iter (+/- 2) |
Updating main PR text in case this is merged. |
Test using for ch s.chars() { black_box(ch) } to have a test that should force the iterator to run its full decoding computations.
Re-use the vector iterator to implement the chars iterator. The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe code is transmuting the decoded u32 into char.
fn decode_multibyte<'a>(x: u8, it: &mut slice::Items<'a, u8>) -> char { | ||
// NOTE: Performance is very sensitive to the exact formulation here | ||
// Decode from a byte combination out of: [[[x y] z] w] | ||
let cont_mask = 0x3F; // continuation byte mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a static
?
Test iterating (decoding) every codepoint.
match opt { | ||
Some(&byte) => byte, | ||
None => 0, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this more performant than foo.unwrap_or(0)
(the same width of characters basically)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments! I didn't like the look of *foo.unwrap_or(&0)
, I think this is better
Awesome work @blake2-ppc! It's always nice to hear that unsafe code isn't necessary when safe code suffices. |
|
||
static TAG_CONT_U8: u8 = 128u8; | ||
/// Mask of the value bits of a continuation byte | ||
static CONT_MASK: u8 = 0x3Fu8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be binary constants too, 0b0011_1111
and 0b1000_0000
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Could you add the benchmarks (or at least mention loose percentages) to a commit message? (So that someone history diving can understand the reasons that this code is designed like this.) |
Thanks to comments from @huonw, clarify decoding details and use statics for important constants for UTF-8 decoding. Convert some magic numbers scattered in the same file to use the statics too.
Thanks to comments from @alexcrichton, write the next/next_back function bodies without nested functions in a more top-to-bottom flow style. Also improve comment style and motivate the unsafe blocks with comments.
Commits pushed to incorporate the excellent suggestions. Benchmarking results added to PR message (will be in merge commit log) |
In-tree benchmarking is very time consuming, but I felt it was necessary to justify the change. The development benchmark was using the html source of en.wikipedia.org/wiki/Main_Page, a typical almost ascii only document with small chunks of of UTF-8. Those benchmarks are even more convincing, for some reason. (Maybe code size? The size of the .next() method maybe pays off in a long-running simple loop like this)
|
fn unwrap_or_0(opt: Option<&u8>) -> u8 { | ||
match opt { | ||
Some(&byte) => byte, | ||
None => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I idly wonder if this is a place where we could use an (hypothetical) unreachable
intrinsic and possibly even gain a little more speed, since the str
invariant guarantees this code will never be hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thought. I didn't want to put fail!()
or similar in there, it would add failure to an otherwise failure less loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot I did try
unsafe {
let ptr: *const u8 = mem::transmute(opt);
*ptr
}
as well, but it didn't improve anything much. @thestinger is a wizard and would know why, but I don't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the assembly/IR for that is certainly really nice: http://is.gd/MLxXeK
; Function Attrs: nounwind readonly uwtable
define i8 @deref(i8* nocapture readonly) unnamed_addr #0 {
entry-block:
%1 = load i8* %0, align 1
ret i8 %1
}
This branch will be really easy to predict, since the None
arm is never taken, so maybe that is making it essentially costless?
Only one uint is needed to keep track of the offset from the original full string.
Reimplement the string slice's `Iterator<char>` by wrapping the already efficient slice iterator. The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe code is transmuting the decoded `u32` into `char`. Benchmarks suggest that the runtime of `Chars` benchmarks are reduced by up to 30%, runtime of `Chars` reversed reduced by up to 60%. ``` BEFORE test str::bench::char_indicesator ... bench: 124 ns/iter (+/- 1) test str::bench::char_indicesator_rev ... bench: 188 ns/iter (+/- 9) test str::bench::char_iterator ... bench: 122 ns/iter (+/- 2) test str::bench::char_iterator_ascii ... bench: 302 ns/iter (+/- 41) test str::bench::char_iterator_for ... bench: 123 ns/iter (+/- 4) test str::bench::char_iterator_rev ... bench: 189 ns/iter (+/- 14) test str::bench::char_iterator_rev_for ... bench: 177 ns/iter (+/- 4) AFTER test str::bench::char_indicesator ... bench: 85 ns/iter (+/- 3) test str::bench::char_indicesator_rev ... bench: 82 ns/iter (+/- 2) test str::bench::char_iterator ... bench: 100 ns/iter (+/- 3) test str::bench::char_iterator_ascii ... bench: 317 ns/iter (+/- 3) test str::bench::char_iterator_for ... bench: 86 ns/iter (+/- 2) test str::bench::char_iterator_rev ... bench: 80 ns/iter (+/- 6) test str::bench::char_iterator_rev_for ... bench: 68 ns/iter (+/- 0) ``` Note: Branch name is no longer indicative of the implementation.
I repeated the last benchmark suite (iterating over the wikipedia main page) using my criterion library, and got a consistent 20%+ improvement in all the benchmarks.
@blake2-ppc Really nice speed-ups! 👍 |
Thanks for running the benchmark, that is cool! |
Reimplement the string slice's
Iterator<char>
by wrapping the already efficientslice iterator.
The iterator uses our guarantee that the string contains valid UTF-8, but its only unsafe
code is transmuting the decoded
u32
intochar
.Benchmarks suggest that the runtime of
Chars
benchmarks are reduced by up to 30%,runtime of
Chars
reversed reduced by up to 60%.Note: Branch name is no longer indicative of the implementation.